Developing Data Scientists: Exploring Free Code Camp’s “2016 New Coder Survey”

Structure of Dataset

The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.

The str function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s list of survey questions and possible answers. Boolean, numeric, and categorical types are the majority.

New Variables

I created six new variables from existing variables:

  • ContinentCitizen and ContinentLive from CountryCitizen and CountryLive using Vincent Arel-Bundock’s countrycode R package
  • PodcastPartiallyDerivative, PodcastBecomingDataSci, and PodcastTalkingMachines from PodcastOther using ifelse statements
  • HoursLearningBucket using the cut function on HoursLearning

These new variables bring our total to 119 variables.

## [1] 15620   119

Data Science/Engineering Subset

646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 646 119

The following analysis explores the characteristics of these developing data scientists and engineers.

Additional comments are included where the results significantly differ from the full new coder survey dataset.

The univariate section mimics the structure of Free Code Camp’s Medium article for direct comparison of data science/engineering students and new coders in general. A few additional univariate plots are included to smooth the transition to the plots explored in the bivariate and multivariate sections.


Univariate Plots

Who Participated

CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.

Of the 646 developing data scientists and data engineers who responded to the survey:

A quarter are women.

##    female 
## 0.2447917

Their median age is 26.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74

They started programming an average of 16 months ago.

This average is 5 months longer than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    8.00   16.17   20.00  360.00      31

Logarithmically transforming the long tail data to better understand the distribution, programming experience peaks around one year.

Learner Goals and Approaches

The average respondent dedicates 14 hours per week to learning.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30

No respondents want to freelance or start their own business.

Compared to 40% for the full new coder survey, this is a bit shocking, but understandable given the demand for data scientists and engineers in industry.

52% percent are already applying for jobs, or will start applying within the next year.

The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.

Most of them want to work in an office, as opposed to remotely.

And a majority are willing to relocate.

Most of them have not yet attended any in-person coding events.

On average, they use at least three different resources for learning.

The developing data scientists/engineers use Coursera, edX, and Udacity more frequently than new coders in general. These companies are have wider subject area scopes than the some of the programming-specific resources listed.

Only 11 of the 954 respondents (1%) that attended a bootcamp in the full survey are pursuing data science/engineering.

6% of new coders from the full survey dataset have attended a bootcamp.

Demographics and Socioeconomics

Data-focused respondents represent 166 countries.

More than 90% are from North America, Europe, and Asia.

Their cities span a wide range of urbanization levels.

Just under a quarter of respondents are ethnic minorities in their country.

And nearly half are non-native English speakers. They grew up speaking one of 148 languages.

67% have earned at least a bachelor’s degree.

Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards graduate studies.

Just over one half are currently working.

Two thirds of new coders in general are currently working.

A quarter work in the tech industry.

Employment fields are more spread compared to the full new coder survey, where 50% of respondents work in software development and IT.

Median current salary is $44k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390

And they expect to earn a median of $60k with their new data science/engineering skills.

With data science/engineering being notoriously lucrative in 2016, some respondents are likely seeking higher wages.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65

7% have served in their country’s military.

## has served in military 
##             0.06501548

13% have children, and another 3% financially support an elderly or disabled relative. And one fifth of them are doing this without the help of a spouse.

## has children 
##    0.1346749
## financially supporting 
##             0.03250774
## no spouse 
## 0.2137405

47% consider themselves underemployed (working a job that is below their education level).

## is underemployed 
##        0.4705882

If they have a home mortgage, they owe an average of $194k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  150000  194400  240000 1000000     591

If they have student loans, they owe an average of $37k.

This average is $3k more than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   10000   20000   36880   45000 1000000     485

Removing the million dollar outlier, the distribution is much more clear with the majority of debt under $75k. I hope that outlier is a joke.

14% don’t yet have high speed internet at home.

## has high speed internet 
##               0.8573913

And 3% are currently receiving disability benefits from their government.

## is receiving disability benefits 
##                       0.02608696

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods - the existence of the diamond pre-sale and post-sale, whereas the survey dataset only covers a single period - the early stages of an individual’s coding care.

If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.

If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, that variable might also be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Though there isn’t a main feature of interest, we can separate the respondents who did not answer “Data Scientist/Data Engineer” to the job role interest question (as we already have for those who did) and compare the two subsets using bivariate and multivariate plots.

I will also explore two smaller features, how many hours dedicated to learning per week and expected salary, using bivariate and multivariate plots.

Of the features you investigated, were there any unusual distributions?

There was a lot of long tail data. Most did not require transformation to view the details of the distribution. Programming experience was really positively skewed, however, and required log transformation to visually compare those with 3 months experience to those with 25 years.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The following operations were performed to tidy, adjust, or change the form of the data:

  • Each code event, resource, and podcast is represented by a boolean variable. I summed the number of yeses for event event/resource/podcast, which created a single row of sums. I used tidyr’s gather() to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create each category’s bar chart. wide to long to factor formats
  • After subselecting all code event, resource, and podcast columns separately, I created a new boolean variable named answered, where 1 represents using at least one event/resource/podcast and 0 represents using none. The answered sum total for each of the three categories is used in the “x out of 646 developing data scientists/engineers answered” label at the bottom of the bar charts above. label
  • I separated data science/engineering-related podcasts in the user-inputted PodcastOther category into their own boolean variables.
  • I changed “NA” in the EmploymentStatus variable to “other” if the respondent provided the user-inputted EmploymentStatusOther variable.
  • I changed “NA” in the EmploymentField variable to “other” if the respondent provided the user-inputted EmploymentFieldOther variable.
  • I separated the “Americas” continents outputted by countrycode() into North and South America.

The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The “Americas” separation was performed for additional insight.


Bivariate Plots

14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 14974   119

The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:

  • Age and Income (0.30)
  • Income and ExpectedEarning (0.36)
  • Income and StudentDebtOwe (0.34)

The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling tends to lead to higher income levels.

For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:

  • Age and StudentDebtOwe (0.24 - 0.10 = 0.14)
  • MonthsProgramming and StudentDebtOwe (-0.07 - 0.09 = -0.16)
  • Income and StudentDebtOwe (0.34 - 0.08 = 0.26)

Interesting. I bet the skew towards graduate studies for the data science subset plays a role here, where higher levels of student debt and higher salaries are expected.

Let’s return to the data science subset of the survey. One of the strongest correlations is between age and current salary.

The earnings vs. age trend isn’t maintained as these individuals prepare to transition to the data science/engineering field. Younger individuals appear willing to capitalize on lucrative data-related salaries and older individuals appear willing to take a pay cut to enter their new field of choice.

The variables on the x-axis in the boxplots below are in descending order in terms of number of respondents.

Since two agender, three genderqueer, and two trans respondents exist and males represent 75% of the subset, we can’t say much about who is most dedicated to learning. The medians for males and females (10 hours per week) are identical.

## Gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   10.00   15.09   20.00   80.00      12 
## -------------------------------------------------------- 
## Gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   12.15   15.00   80.00      11 
## -------------------------------------------------------- 
## Gender: genderqueer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   13.50   20.00   25.67   35.00   50.00 
## -------------------------------------------------------- 
## Gender: agender
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     4.5     7.0     7.0     9.5    12.0 
## -------------------------------------------------------- 
## Gender: trans
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       9      16      16      23      30

The medians for five of the six continents are identical (10 hours per week). The bulk of Asian students appear most dedicated to learning, with their 75th percentile approaching 25 hours per week. Africa may be suffering from a small sample size issue with only 11 respondents.

## ContinentCitizen: North America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.39   20.00   80.00      12 
## -------------------------------------------------------- 
## ContinentCitizen: Europe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    6.00   10.00   14.41   20.00   50.00       4 
## -------------------------------------------------------- 
## ContinentCitizen: Asia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   15.47   23.75   56.00       4 
## -------------------------------------------------------- 
## ContinentCitizen: Oceania
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     5.0    10.0    11.6    15.0    42.0 
## -------------------------------------------------------- 
## ContinentCitizen: South America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00    7.75   10.00   15.06   16.25   40.00       1 
## -------------------------------------------------------- 
## ContinentCitizen: Africa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.500   6.000   8.727  15.000  21.000

Again, male and female medians are identical. They both expect around a $60k data science/engineering salary. There is a gap in first quartiles, however, as females expect $10k more than males.

## Gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   60770   80000  200000      45 
## -------------------------------------------------------- 
## Gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   50000   60000   61560   80000  150000      14 
## -------------------------------------------------------- 
## Gender: genderqueer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48000   54000   60000   59330   65000   70000 
## -------------------------------------------------------- 
## Gender: agender
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30000   37500   45000   45000   52500   60000 
## -------------------------------------------------------- 
## Gender: trans
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   65000   66250   67500   67500   68750   70000

Whoa. Expected earning by continent varies way more compared to the above three boxplots. Most North Americans expect the highest range of salaries, with their interquartile range spanning from $55k to $80k. The 75th percentile for Europe is $5k below North America’s 25th percentile. I wonder if some European respondents forgot to convert from pounds or euros to USD. Expectations in Asia are all over the board.

A lot of these individuals are using similar, if not the same, online educational resources. Labour market economics can be cruel.

## ContinentCitizen: North America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12000   55000   65000   68420   80000  200000      22 
## -------------------------------------------------------- 
## ContinentCitizen: Europe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   24000   40000   41960   50000  120000      23 
## -------------------------------------------------------- 
## ContinentCitizen: Asia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   50000   55470   86300  150000       7 
## -------------------------------------------------------- 
## ContinentCitizen: Oceania
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10000   40000   65000   65290   75000  160000       3 
## -------------------------------------------------------- 
## ContinentCitizen: South America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12000   33000   45000   48730   55000  100000       2 
## -------------------------------------------------------- 
## ContinentCitizen: Africa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   40000   62500   77500   85300   87500  200000       1

Salary expectations don’t vary much depending on hours dedicated to learning. Other than those who dedicate 40+ hours per week, an expected salary in the $40k to $80k range is standard.

## 
##  (0,10] (10,20] (20,40] (40,80] 
##     351     136     101      19
## HoursLearningBucket: (0,10]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61700   80000  200000      28 
## -------------------------------------------------------- 
## HoursLearningBucket: (10,20]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   58380   75000  120000      10 
## -------------------------------------------------------- 
## HoursLearningBucket: (20,40]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   58130   75000  200000       8 
## -------------------------------------------------------- 
## HoursLearningBucket: (40,80]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   50000   60000   66740   85000  135000

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

The data science/engineering subset of the survey is largely similar to the non-data science/engineering subset, except for three correlations involving student debt owed. I bet this has something to do with the skew towards graduate studies for the data-focused subset.

The correlation between current salary and age is stronger than expected earning for the individual’s first data science/engineering job and age.

Hours dedicated to learning per week doesn’t appear to vary much with gender or continent, though sample size issues exist.

Expected earning varies strongly by continent. Females also appear to have a higher bottom line for expected salary than males. Those who dedicate more than 40 hours a week to learning data science/engineering appear to expect higher salaries as well.

What was the strongest relationship you found?

For both subsets of the survey, there is no exceedingly strong relationship. All correlations are below 0.4.

Current salary and expected salary for an individual’s first data-related job has the strongest relationship for both subsets with correlations of 0.36 and 0.38.


Multivariate Plots

Let’s dig deeper into the two strongest correlations, income against expected salary and student debt owed, using multivariate analysis.

The male/female wage gap is evident through each gender’s presence above the $100k lines. There aren’t enough data points for genderqueer and trans individuals to draw conclusions.

Ethnic minorities appear to be optimistic about the changing diversity landscape via their expected salaries. They have a notable presence above the $100k expected earning line, but not the $100k current salary line.

Current salaries and student debt levels for graduate students are relatively high, as expected. Bachelor’s degrees appear to have the worst student debt/current salary balance.

It appears that the student debt remaining vs. current salary relationship doesn’t differ much across hours dedicated to learning brackets.

Multivariate Analysis

Were there any interesting or surprising interactions between features?

It’s interesting that females expect their next salary to be as relatively low as their current one, but ethnic minorities expect a higher salary.

I’m also surprised more individuals with high levels of student debt aren’t dedicating 20+ hours to learning each week. I bet the current jobs of the individuals in lower time brackets are preventing them from increasing their pace.

The affordability of quality education online is a huge reason why I’m in the 40-80 hour bracket for my personalized data science master’s degree.


Final Plots

Plot One

Description One

Males and non-minorities appear most frequently above the $100k lines. The wage gap is evident in current salary for both females and minorities. Though females appear to expect lower salaries than men, minorities are better represented above the $100k expected earning line.

Higher dispersion exists for the majority demographic in both cases. The relationship between expected and current salary is much stronger for the minority demographic.

Plot Two

Description Two

The majority of individuals who pursued post-secondary education are above the $25k student debt remaining line. Compared to the data science/engineering subset, the lack of correlation between student debt and current salary for the full survey dataset now makes sense. The aforementioned skew towards graduate studies appears to instead be a skew towards post-secondary studies in general, however.

Plot Three

Description Three

The highest proportion of individuals above the $50k current salary line belongs to the 0-10 hours dedicated to learning bracket. Proportions of individuals above the $25k student debt remaining line are similar across brackets.


Reflection

Developing data scientists and engineers are slightly different than new coders in general.

  • They have programmed for longer.
  • They want to work for developed companies, rather than freelance or create their own.
  • They have a longer job search time horizon.
  • They use Coursera, edX, and Udacity more frequently.
  • They use bootcamps less frequently.
  • They have completed higher levels of education.
  • They come from a wider subject area background.
  • Fewer are currently working.
  • Fewer work in the tech industry.
  • They have more student debt.

The two datasets do share plenty of common trends. Demographics are similar. Most are willing to relocate. Most don’t use podcasts or attend events yet.

Diversity is still an issue in the workplace, as reflected in current and expected salary for females and ethnic minorities. Student debt owed matches well with current salary and higher levels of education. Most people aren’t replacing the traditional college/university route with fulltime online education…yet.

The successes of this exploration are largely due to the detailed design of the Free Code Camp survey.

The main struggle I encountered in this exploration was the lack of a main feature of interest, like the diamond dataset’s price variable. It would be awesome if we could survey the same respondents in a decade or so. We could combine career earnings and career satisfaction with the 2016 survey’s results to build a predictive model to estimate career success.

These are the people who are learning data science and engineering. It is clear that free, self-paced learning resources are important.